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ABSTRACT 

In today’s competitive and big data era, parties need to understand the global trends and patterns for 
several businesses, research or other decisive purposes and hence need to share data. Data sharing may 
invite number of privacy threats, as the shared data may contain sensitive information which must not 
be disclosed to others. Number of techniques ensuring privacy preservation by masking sensitive 
information exist. Heuristic approaches to hide sensitive patterns are simple, fast and widely adopted 
form of technique. With the explosive growth of data and sequential nature of existing conventional 
techniques, issue of scalability arises. To improvise the existing sensitive data hiding techniques in 
terms of scalability, parallelization is a promising solution. MapReduce is a parallel programming 
framework which uses the power of several distributed machines to achieve parallelization. Here, we 
propose a improvised and scalable two phase MapReduce version of primarily heuristics based 
approach to achieve the scalability while maintaining good ratio of privacy and knowledge content. 
Number of experiments are conducted in order to evaluate the performance in terms of execution time 
taken to sanitize the voluminous amount of data. The results shows that the MapReduce version 
outperforms the sequential approaches and can handle large-scale data in efficient and scalable fashion. 

1. INTRODUCTION 

Sharing and analyzing of datasets is essential to understand the global patterns for various decisive 
purpose in businesses, research etc. Sharing of data increase the risk of privacy leakage as data may 
contain sensitive information [1], Hence, sensitive information need to be removed or masked before 
sharing with others. The explosive growth of data made, masking of sensitive information present in a 
dataset, a challenging task. Number of techniques to preserve the privacy of data exists. These 
techniques can be broadly divided into three major categories [2]- Heuristic approaches, Border based 
approaches and exact approaches. Heuristic provides simple and fastest mode of sanitization but with 
low data quality. Border based approaches have high computation requirement but the data quality 
achieved is good in comparison to heuristic. Lastly the exact techniques provides the optimal solution 
to mask the sensitive information but the computational complexity is too high. 

Heuristic approaches are simple, fast and widely adopted techniques for hiding sensitive itemsets[3]. 
They provide good level of privacy by hiding sensitive information and some of them also attempted to 
maintain good knowledge content with in the sanitized dataset. Disclosure limitation of sensitive rule 
[1] is one of the primary work undertaken to calculate the victim time by traversing dataset in typical 
heuristic manner. They created a lattice like structure of data and proposed a greedy traversing to 
evaluate the maximum support victim item against each sensitive rule. Other heuristic approaches like 
multiple rule hiding [4], masking of sensitive information in round robin fashion [5], aggregation and 
disaggregation concept to hide sensitive rule [6] exists which attempted to improve [f] by reducing the 
impact on non-sensitive data, fairness of selecting victim item and transaction. But most of these 
traditional approaches are sequential in nature. 


With the explosive growth of data these conventional techniques perform inefficiently and there 
sequential nature act as a bottleneck while processing large scale datasets. To preserve privacy, data 
quality and achieving the ability to handle such a large scale data these techniques must be improvised, 
so that they can efficiently process provided data set irrespective of its size and volume. Greedy 
selection of transaction, simple traversing technique, fast victim selection and parallelization of 
traditional approaches are some of the promising solution to achieve all above stated goals. 

MapReduce parallel programming framework [7] is a concept which can achieve parallelization by 
distributing the computation over n number of nodes. The nodes which are involve in a computation, 
are combined together to form a cluster. Simplicity and fault-tolerance are the key features attracting 
the attention of the applications requiring high computation power to accomplish their tasks in easy and 
efficient fashion. MapReduce job mainly consists of two tasks called Map and Reduce. Map takes the 
input in (keyl , value 1 ) form and provides the intermediate result (key2, value2 ), which are further 
sorted, shuffled and provided as input to the Reducer. The pair with the same key is provided to the 
same reducer. Finally, the reducer compute and merge the input and produces the output in form of 
(key3, value3). 

Here, we propose a scalable and improvised two-phase MapReduce based version of primarily existing 
heuristic approach [1] which can efficiently hide sensitive information, maintain good data quality as 
well as can handle huge dataset in much scalable fashion. We deliberately designed number of Map 
Reduce jobs out of the existing algorithm and implemented them to obtain scalable two-phase 
MapReduce version. 

The organization of this paper is as follows: In Section 2, we will discuss some of the closed related 
work and define the problem undertaken. Further, in Section 3 some preliminaries are discussed. In 
Section 4, we proposed the scalable improvised two-phase MapReduce version of the heuristic 
approach. Experimental evaluation and results are discussed in section 5. Finally, Section 6 draws the 
conclusion. 

2. RELATED WORK 

Privacy preservation techniques can be broadly divided into two categories: privacy preservation for 
outsourced data where data need to protected when stored at third place eg. access control. Another are 
privacy preservation before data is shared with any other party eg. sensitive pattern hiding. Some of the 
work to achieve scalability is done in first category where parallelization is introduced in 
anonymization and encryption. Parallel anonymization technique called top-down specialization 
approach is introduced in [8] using MapReduce framework. The approach can handle large scale data 
in scalable fashion as the MapReduce version of the traditional approach to anonymize the data has 
been proposed. This approach made outsourcing of data in secure and scalable fashion. Elgamal 
encryption is proposed in [9] which encrypt and restrict access of data over cloud. A fast parallel 
algorithm called EFPA over cloud is introduced in [10], The approach allows number of parties to 
combine their data and mine global rules and patterns without any data leakage. EFPA protocol ensures 
the privacy of data by mining encrypted data of data providers and also has low computational cost. All 
the above work comes under first category where scalability is achieved for fast privacy preservation of 
outsourced data. A concept of Hybrid cloud is introduced in [11] to access the data over cloud in 
privacy preserved fashion. A private cloud as an interface between client an public cloud where data is 
stored is introduced. The services like keyword searching and data accessing over encrypted data has 
been provided. A privacy preserving layer is introduced in [3] which helps in sanitizing data before 
uploading over cloud. The layer is flexible, dynamic and scalable built over the MapReduce 
framework. But techniques to hide sensitive co-occurring patterns are still sequential. These approaches 
need to be improvised as they can help in hiding all the sensitive information before we share data with 
others. 

3. PRELIMINARIES 


In this section, we will discuss the basic terms and definitions. 


3.1 Basic Terminology 

• Frequent Item- The item(i) G D is considered as frequent if its support is greater or equal to defined 
minimum support threshold i.e. (i(supp) > T h supp ). 

• Support indexing- The table contain the occurrence frequency(support) of every 1-frequent item(i) 
present in the dataset. For all i- > s where i G D and i is (1-frequent) 

• Related File- For each sensitive itemset s G S a entry is made showing its current support and further, 
reduction required i.e. (current_support - Threshold) 

• Sensitive Information- The co-occurring patterns( itemsets), if occur together revel some information 
which a client or a firm do not want to revel is considered as sensitive. 

• Data quality- After sanitizing dataset i.e. hiding all sensitive information, the modified data must 
contain good ratio between privacy and knowledge content such that its analysis produces set of useful 
association rules. 

• Supporting Transaction- A transaction T G D is called supporting if it contain one or more sensitive 
itemset. 

• Transaction Length- The total number of items present in the transaction, is considered as length of 
transaction. 

• Scalability- The scalability may defined as the ability to handle large scale dataset by maintaining 
good performance in terms of execution time. 

3.2 Background Discussion 

Disclosure limitation of Sensitive Rules [1] is one of the primary, simple and basic heuristic technique 
with greedy heuristic traversal to hide the sensitive rule. The algorithm select the large itemset of the 
rule as sensitive itemset. All the itemsets selected against each rule are sorted based upon their support. 
For each sensitive itemset s G S traverse to its immediate subsets till it reaches to highest support 1- 
frequent item. This item is selected as new item to be removed. The transaction supporting the itemset 
is identified and selected victim item is removed and changes are made with respect to any other 
item/itemset in the graph if get effected. This helps in hiding all the sensitive itemsets in one by one 
fashion. The approach is simple and fast, but comes with following issues: 

• Scalability: Techniques are processing data in sequential fashion therefore, have the issue of 
scalability. 

• Sequential approach: Basic approach mask the sensitive rules in one by one fashion as well as 
transactions are modified on first come come first serve basis. 

• Computation complexity: Separate batches of supporting transactions against each sensitive 
item set are prepared and updated repeatedly for each modification. 

Number of data scans are required while propagating the impact of any modification. All these above 
issues call for improvement in existing approaches, ffence, we deliberately design multiple map and 
reduce jobs to achieve the task parallelization and ability to handle large scale data effectively and 
efficiently. We improved the approach to hide sensitive rules in parallel fashion as well as besides 
handling transaction in first come first basis a greedy transaction selection is made i.e with minimum 
length. This help in reducing the hiding effect over non-sensitive information. To reduce the number of 
database scans and propagation complexity, our approach do not make batches of supporting 
transaction 

against each sensitive itemset but separates sensitive dataset from non-sensitive transactions and then 
divide it into small data chunks. Further, these data chunks are distributed over n nodes to achieve 
parallelization and each transaction is sanitized independently. No batch formation approach remove 
the requirement of propagating effect of any modification hence reduce the computation complexity. 
Hence, our work provide a improvised scalable heuristic approach over MapReduce framework. 

3.3 MapReduce 

MapReduce is a parallel programming framework [7] with immense computational power. The 


framework distribute the task over n widely distributed computation nodes to achieve parallelization. It 
is very simple, flexible and fault-tolerance and hides the complexity of any failure if occurs. These key 
features make it a first choice in order to achieve the scalability. MapReduce is composed of two main 
jobs that is Map and Reduce. It take input and produces the output of the data in (key value) pair. 
Figure 1 shows the one phase of MapReduce computation. Number of mapper and reducer running for 
a processing a dataset measures its job level parallelization and number of subroutines achieve task 

level parallelization [8]. MapReduce divides the dataset into number of data chunks D = di, d 2 ,.d„ 

and distribute each over n number of nodes. Number of data chunks depends on initial size of the file. 
By default data chunk size is 64 MB/128 MB. If the file is smaller than default value no division occur. 
MapReduce replicate each chunk over three nodes called data nodes to handle the failure if occurs. We 
used abundant computation power of MapReduce framework and proposed a fully parallelized, 
efficient and scalable MapReduce version of our approach. 
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Figure 1: Single phase MapReduce Computation [13] 

4. PROPOSED PARALLELIZED HEURISTICS BASED APPROACH 

As discussed in Section 3.2, we require three fold improvement in the basic approach. Here we will 
discuss the improvements and MapReduce implementation to preserve privacy with good data quality, 
scalability and reduced computational complexity. Firstly, the traditional approach created lattice like 
structure of dataset and for each victim deletion require multiple lattice travel to propagate the effect of 
transaction modified with respect to any sensitive itemset. Our approach maintain indexing of support 
against each frequent item(i) E D. The proposed approach also sanitize sensitive rules and transactions 
in parallel instead of one by one fashion. Another list is maintained for each sensitive itemset with 
three field ie. (s, sup(s), tran(t)) for each restricted rule where sup(s) is support of s and tran(t) is 
support and number of transactions further need to modify in order to make the rule in-frequent. After 
every removal of item, instead of traversing whole lattice and propagate the modification effect only 
these two file are updated. This help in decreasing the computation cost in following ways: 

• We need not to create and update supporting transaction batch for each sensitive rule every 
time. Transaction is considered for sanitization only once, means in only in one attempt all 
sensitive itemsets it is supporting are masked. 

• Support indexing and list against each restricted rule, are created initially and all modifications 
are propagated directly to them by simply decreasing the values against rule or item, which are 
not much computationally demanding in comparison to lattice traversal. 

• No more batches of transactions need to be created and maintained against each restrictive 
pattern, which reduces the number of database scan required. It can be clearly seen that the 
number of scan required by proposed approach is only two. 

• A transaction is sanitized with respect to all the frequent sensitive itemset in one round only 
hence, there is no issue of effect propagation and no transaction need to be added or removed 
which ultimately reduces the computation complexity. 

• Data sanitization is done in distributed fashion using MapReduce framework as well as all the 
sensitive itemsets a transaction is supporting are sanitized at once which make the approach 












fast and scalable. 


The improvised version of of existing traditional approach is a simple, fast and scalable way to sanitize 
large scale dataset. But still a problem of maintaining good data quality of sanitized dataset is 
unattended. In above approach the transaction need to be modify is on first come first serve basis 
means the first transaction supporting the sensitive itemset will be considered without considering the 
side effect on non-sensitive information present in dataset. Therefore, if we choose transaction smartly, 
it can be possible to maintain good data quality. Several approaches uses the concept of ’supporting 
transaction separation’ [6] and ’minimum length sanitization first' [5] in order to reduce the size of 
dataset need to be scanned,as well as for maintaining good data quality. Dataset may contain set of 
transactions which do not support any of the sensitive rule. Therefore it is an overhead to process them. 
Hence, if we separate them out in initial phase only, we will be saving time as well as computation cost. 
Secondly the supporting transaction also contain information which is not restricted at all but removal 
of sensitive item may cause worst effect on them. Therefore the minimum length transaction concept 
works. If the length of transaction means number of item in transaction T is less then probability of 
non-sensitive information present in it will also be less. Therefore, inspite of sanitizing random 
supporting transaction we must sanitize the minimum length transaction first. Hence a promising data 
quality can be maintained. Further we will discuss the MapReduce implementation of the above 
improvised heuristic concept. 

We deliberately divided the algorithm in Map & Reduce phases. Firstly data partition will be called and 

dataset is divided into number of data chunks D = di, d 2 ,.d n distributed over number of computing 

nodes. In phase I, data chunk (d,) and set of sensitive itemset (S) S = Si, S 2 .s m is provided as input. 

The transaction separation for (d,) is called and victim item (v) selection against each sensitive itemset 
is calculated. A length calculator also run to find the length of the transaction producing (T id , len) as 
output. MapReduce sort the intermediate set of supporting transactions D” in ascending order of length. 

Again the D” is partitioned into data chunks that is D " = d T , d 2 " .d and provided as a input to 

Phase-II. The MapReduce phase of algorithm can be easily explained as below: 

Phase I 

Input- Dataset D, Set of sensitive itemsets S. 

Output -Victim item(v) against each sensitive itemset s. 

Step 1: For each supporting transaction, read the dataset T G D, identify if, supporting any sensitive 
rule T. If not directly add to sanitized dataset D*. 

Step 2: Total number of items(i) G T, count length of transaction and calculate support of item and add 
to index file. 

Step 3: Calculate the current threshold th for each sensitive rule, number of transaction need to be 
sanitized to hide sensitive rule and add to the list against corresponding sensitive rule. 


Phase II 

Input- Sorted data chunk (di"},the set of sensitive patterns S, victim item v corresponding to each 
sensitive pattern. 

Output- Sanitized dataset D’ with all the sensitive information masked. 

Divide the sensitive data D " = di" , d 2 " .d into number of data chunks and distribute it over 

number distributed nodes and 

Step 1: For each sensitive itemset s G S, find item(i) G s, carrying maximum support is select it as 
the victim item . Arrange the transactions in sensitive dataset, in increasing order of length(Len). Thus, 
we sanitize minimum length transaction first which minimize the impact on sanitized dataset. 

Step 2: For every transaction, identify sensitive itemset it is supporting and check in the list if the 






sensitive itemset is already hidden or not. If not remove victim item corresponding to the rule and 
remove it from the transaction. Every time we remove the victim item and perform effect propagation 
directly to support indexing and list against each restrictive rule i.e. reduce support of deleted victim 
item as well as check if any other sensitive rule supported by transaction have got effected then reduce 
its support too in its related list. 

Step 3: Repeat till all the sensitive rule are masked. Set of all modified transactions are merged with 
non-sensitive transaction identified initially, to generate final sanitized dataset D*. 

5. EXPERIMENTS AND RESULTS 

5.1 Overall Evaluation 

As discussed in section 4, we improvised basic heuristic approaches [9] in order to propose a scalable, 
simple, fast and good data quality maintaining heuristics based approach by hiding all sensitive 
information present in dataset. Beside making lattice structure of data and traversing for finding the 
victim candidate, support index is maintained, which can be directly accessed to find the support of 
item(i). It reduces the computational complexity and makes the approach simple and fast. Initially, 
supporting transactions are filtered out in order to reduce the size of dataset, this makes the approach 
fast. Inspite of identifying the supporting transaction in first come first serve basis, we sorted the 
transactions in increasing order of their length to sanitize minimum length transaction first. This help 
in decreasing the hiding side-effect over non-sensitive information in dataset. All sensitive itemsets are 
masked in parallel fashion instead of one by one. A related list against each restricted rule is 
maintained and updated for every modification directly and need not to propagate the modification 
effect by traversing all through the lattice. Data is partitioned into number of of small data chunks 
which are distributed over different computing nodes and sanitized in parallel using MapReduce 
framework, which ultimately achieve the speedup and scalability. 

5.2 Experimental settings 

We compared the proposed MapReduce version of approaches with sequential version on synthetic 
data generated using IBM Quest Synthetic Data Generator. The data size vary from 5 Million to 100 
Million of transactions (1GB-3GB). We implemented both version of technique in java on Ubuntu 
machine using Eclipse and MapReduce version using single node Hadoop platform. We have 
compared performance of MapReduce version and sequential approach in three major directions. First 
set of experiment shows the effect of varying data size over time taken to mask all sensitive 
information present. Second set of experiment evaluates the performance in terms of time taken to 
sanitize the given dataset, when percentage(%) of sensitive content present in the dataset varies and 
lastly, the effect of varying minimum support threshold. 

5.3 Results and Discussion 
1. Effect of Varying Data Size 

In Figure 1, it can be clearly seen that with varying data size the time taken to sanitize the dataset using 
sequential traditional approach is much higher than the MapReduce version proposed. As, the 
traditional approach require more number of database scan as well as mask sensitive rule in one by one 
fashion using lattice structure. Hence, perform sanitization in more time. Secondly, the improved 
version by using support indexing removes the need of multiple dataset scan as well as multiple lattice 
traversal for effect propagation is no longer required. Therefore, performance improves . Finally, the 
MapReduce version sanitize all the sensitive rules in parallel fashion as well as achieve data 
parallelization. Hence the MapReduce version can sanitize large-scale dataset in much efficient and 
scalable fashion. 
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Figure 2: Effect of Varying Data Size on Execution Time 
2. Effect of Varying Sensitive Content in Dataset 

With varying % of sensitive information present in the dataset, the sequential approach require much 
more time than the proposed parallel and improved version. Because, with more number of sensitive 
rules, traditional approach more transaction Id list, more traversal etc but with MapReduce version the 
transaction parallelization as well as sensitive rule parallelization have been achieved. Therefore, 
Figure 2 clearly shows that the execution time taken to sanitize sensitive content is much less than the 
conventional approach. Hence, performance of proposed approach is much scalable. 


Effect of Varying Content of Sensitive Information in Dataset 
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Figure 3: Effect of Varying Sensitive Content in Dataset 


3. Effect of Varying Support Threshold Over Execution Time 

With the increasing minimum support threshold the amount of frequent information will decrease, 
hence, the number of sensitive information will also be less. Therefore it can be clearly seen that with 
increasing support threshold the execution time for sanitizing the sensitive information decreases. 
Secondly, the performance difference of sequential, improved and parallel version can also be seen. It 





is quite clear that MapReduce version can sanitize the sensitive information much scalable fashion 
than sequential approach. 


Effect of Varying Support Threshold on Execution Time 
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Figure 4: Effect on Execution Time With Varying Minimum Support Threshold 

6. CONCLUSION 

Data sharing and analysis to understand the trend in market is the necessary and unavoidable step in 
business. But sharing of data may lead to leakage of sensitive information. Therefore, privacy 
preservation is a prime challenge in today's big data era. Conventional data hiding approaches suffer 
from the issue of scalability. With the explosive growth of data, the performance of these approaches 
degrades in terms of execution time. Hence, we proposed the parallel version of a sequential heuristic 
approaches using MapReduce framework by designing multiple Map and Reduce jobs. Number of 
experiments have been demonstrated to compare the approaches with varying data size, % of sensitive 
information and minimum support threshold. Result shows that the MapReduce version achieve good 
speedup as well as outperforms in terms of scalability. 
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